K-Means Clustering - A Sparse Version of Principal Component Analysis
Introduction
K-Means algorithm is one of the popular clustering analysis techniques, that helps group similar data points into clusters and look for hidden insights. For example, by analyzing sales data, customers can be grouped into segments to help understand their behaviour. In one of the earlier posts, we explored Principal Component Analysis, a dimension reduction tool that transforms a number of (possibly) correlated columns into a (smaller) number of uncorrelated columns called principal components.
While principal components can help rank the records of the dataset, k-means just groups the records into, for example, good/bad. Thus, K-Means is a sparse version of PCA. We explore this idea more with the European employment dataset by grouping similar countries. The Principal Components are then combined with the results of K-Means to see the similarities between the two techniques.
Required Packages
The packages listed below in the code chunk are required to replicate this project and generate this HTML document.
library(tidyverse) # The gas for the caR
library(cluster) # Cluster Analysis
library(factoextra) # Visualizing Clusters
library(plotly) # Interactive Plots
library(gridExtra) # Plot graphs in grids
library(DT) # Print HTML Tables
library(knitr) # Generating HTML document
library(rmdformats) # Document themeData
The dataset contains the distribution of employment in different sectors of 26 European countries. There are ten variables, the first being the name of the country, and the remaining nine holding the percentage employment of the countries’ citizens in nine different industries.
data_definitition <- data.frame(variable = c("Country", "Agr", "Min", "Man", "PS",
"Con", "SI", "Fin", "SPS", "TC"),
description = c("Name of the country",
"Percentage employed in agriculture",
"Percentage employed in mining",
"Percentage employed in manufacturing",
"Percentage employed in power supply industries",
"Percentage employed in construction",
"Percentage employed in service industries",
"Percentage employed in finance",
"Percentage employed in social and personal services",
"Percentage employed in transport and communications"))
data_definitition %>% datatable(caption = "Variable Descriptions")The dataset seems to contain USSR and East Germany, so I’m assuming this comes from at least from or before the 80’s. Data is read in and a box plot of distributions in each industries in presented below.
data <- read.table("https://raw.githubusercontent.com/mr-hn/kmeans-pca/master/european_jobs2.txt",
header = TRUE, sep = "\t")
rownames(data) <- data[,"Country"]
data <- data[,-1]
data %>% datatable(caption = "The Raw Data")gather(data ,"industry","percentage") %>%
plot_ly(x = ~industry, y = ~percentage, type = "box",
hoverinfo = "text",
text = ~paste0(rownames(data), " - ", percentage,"%"),
line = list(color = "rgb(159, 33, 66)"),
marker = list(color = "rgb(33, 17, 3)")) %>%
layout(title = "Distribution of employments in different sectors - Europe, 1980\'s",
xaxis = list(title = "Industries"),
yaxis = list(title = "")) %>%
config(displayModeBar = F)K-Means Basics
K-Means clustering is one of the most commonly used unsupervised machine learning algorithms to group data into a pre-specified K clusters. The data points are assigned to the groups in such a way that the data points are as similar as possible. Each cluster is represented by a centroid, which is the mean value of all the data points in it. The within-cluster similarity is measured through the sum of squares of the residuals between the centroid and all other data points assigned to that centroid.
- \(C_k\) is cluster K
- \(x_{ik}\) is a data point belonging to a centroid K
- \(\mu_k\) is the average of all data points in cluster K
For reference, this excellent post by Brad Boehmke goes into a lot more detail.
The K-Means algorithm is applied to the dataset with the clusters varying between 2 to 7. The countries are colored based on the cluster they are assigned to and presented below on the chloropleth. The number of clusters can be changed from the dropdown. For two clusters, the algorithm pits Turkey against the rest of the Europe. With seven clusters, we see the countries separated broadly into the central Europe, the Mediterranean, East and Scandinavia. Turkey and Hungary are in their own clusters.
# K-Means requires data to be scaled normally N(0,1).
# This is to make Euclidean distance measurement comparable.
data_scaled <- scale(data) %>% as.data.frame()
# Country code is combined to plot on map
country_code <- read_csv("country_code.csv") %>%
rename(country = "COUNTRY", code = "CODE")
data_kmeans <- data_scaled %>% rownames_to_column("country") %>%
left_join(country_code, by = "country")
# Combining the results of K-Means algorithm
data_kmeans$k <- 0
data_kmeans$cluster <- 0
# Into geo_data
geo_data <- NULL
for (i in 2:7) {
kmeans_model <- kmeans(data_scaled, i, nstart = 25)
data_kmeans$cluster <- kmeans_model$cluster
data_kmeans$k <- i
geo_data <- geo_data %>% bind_rows(data_kmeans)
}
# Drop unnecessary columns
geo_data <- geo_data %>% select(country, code, k, cluster)
# Change data structure for Plotly
geo_data_played <- geo_data %>% spread(k, cluster, sep = "value")
geo_colnames <- colnames(geo_data_played)
# Plotluy Chloropleth settings
l <- list(color = toRGB("grey"), width = 0.5)
g <- list(showframe = FALSE,
# lonaxis = list(range = c(-20, 50)),
# lataxis = list(range = c(30, 90)),
lonaxis = list(range = c(-10, 50)),
lataxis = list(range = c(30, 80)),
showcoastlines = FALSE,
projection = list(type = 'Mercator'))
# Build plot with data for two clusters visible
geo_plot <- geo_data_played %>% plot_geo() %>%
add_trace(z = ~kvalue2, color = ~kvalue2,
text = ~country, locations = ~code,
marker = list(line = l), showscale = FALSE) %>%
colorbar(title = "Scale") %>%
layout(geo = g,
title = "Countries grouped based on employment",
annotations = list(x = 0, y = 1, text = "Change clusters in dropdown",
showarrow = FALSE, xref = "paper", yref = "paper"))
# Add plots for clusters 3:7 invisible
for (col in geo_colnames[4:8]) {
geo_plot <- geo_plot %>%
add_trace(z = geo_data_played[[col]], color = geo_data_played[[col]],
text = ~country, locations = ~code,
marker = list(line = l), showscale = FALSE, visible = FALSE)
}
# Add dropdown for clusters and set corresponding visibility
geo_plot %>%
layout(
updatemenus = list(
list(y = 0.9, x = 0.08,
buttons = list(
list(method = "restyle",
args = list("visible", list(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE)),
label = "2"),
list(method = "restyle",
args = list("visible",list(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)),
label = "3"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)),
label = "4"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE)),
label = "5"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)),
label = "6"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)),
label = "7")
)))) %>%
config(displayModeBar = FALSE) We clearly see patterns emerging that we didn’t know existed before. Countries that are geographically and culturally closer had similar employment patterns. Unsupervised learning thus helped us draw inferences without a labelled response in the input dataset.
Determining the optimal number of clusters
Because the data was on a map, it was easier for us to say that 7 clusters grouped the data better than 2. But in other cases where the output may not be easily interpretable, it becomes essential to be definite about the number of clusters. There are three major methods to determine K.
Elbow- The number of clusters is plotted against the within residual sum squares and the cluster where there is a sharp drop is considered optimal.
Average Silhouette- The silhouette value is a measure of how similar an data point is to its own cluster compared to other clusters. The optimal number of clusters is the one that maximizes the average silhouette.Gap Statistic- This method computes the within-cluster variations and compares them against the clusters of another randomly generated reference dataset. The K that gives the highest difference is considered optimal.
plot1 <- fviz_nbclust(data_scaled, kmeans, method = "wss") +
ggtitle("By Elbow Method") + xlab("") + theme(text = element_text(size = 8))
plot2 <- fviz_nbclust(data_scaled, kmeans, method = "silhouette") +
ggtitle("By Silhouette Method") + xlab("") + theme(text = element_text(size = 8))
set.seed(1804)
gap_stat <- clusGap(data_scaled, FUN = kmeans, nstart = 25, K.max = 12, B = 50)
plot3 <- fviz_gap_stat(gap_stat) +
ggtitle("By Gap Statistic") + xlab("") + theme(text = element_text(size = 8))
grid.arrange(plot1, plot2, plot3, nrow = 2, ncol = 2)The result of the Elbow method is ambiguous, while three clusters have the highest average silhouette. The gap statistic method suggests one cluster as the optimum, which is impractical.
Going by the Silhouette method, three clusters broadly separate the data into western and eastern Europe, leaving out Turkey and Yugoslavia into the third cluster. This seems fair, and the rest of the document will work with three clusters.
data$cluster <- kmeans(data_scaled, 3, nstart = 25)$cluster %>% as.numeric()
data %>% rownames_to_column("country") %>%
ggplot(aes(x = reorder(country, cluster), color = as.factor(cluster))) +
geom_text(aes(y = 1),label = rownames(data)) +
coord_flip() + theme_minimal() +
theme(legend.position = "none", axis.ticks = element_blank(), axis.text = element_blank(),
axis.title = element_blank(),
panel.grid.major = element_blank(), panel.grid.minor = element_blank())Merging Results with Principal Components
PCA model is applied to the data and the results are printed below. It’s observed that the first three components explain ~76% of the variance. We will be utilizing only the three components to make plotting possible.
pca_model <- princomp(data, cor = TRUE) %>% summary(loadings = TRUE)
eigen_values <- pca_model$sdev^2
pca_model$sdev %>% as.data.frame() %>% rownames_to_column("component") %>%
bind_cols(eigen_value = round(sqrt(eigen_values), 3),
proportion = round(eigen_values/sum(eigen_values), 3),
cumulative = round(cumsum(eigen_values)/sum(eigen_values), 3)) %>%
select(component, eigen_value, proportion, cumulative) %>%
datatable(caption = "Results of Principal Component")pca_model$scores %>% as.data.frame() %>% rownames_to_column("country") %>%
bind_cols(data) %>% select(Comp.1, Comp.2, Comp.3, cluster, country) %>%
rename("component_1" = Comp.1, "component_2" = Comp.2, "component_3" = Comp.3) %>%
plot_ly(x = ~component_1, y = ~component_2, z = ~component_3, color = ~cluster,
hoverinfo = "text",
text = ~paste(country, "\n",
"Component 1 =", round(component_1, 3),"\n",
"Component 2 =", round(component_2, 3),"\n",
"Component 3 =", round(component_3, 3),"\n")) %>%
add_markers() %>% hide_colorbar() %>%
layout(scene = list(xaxis = list(title = 'Component 1'),
yaxis = list(title = 'Component 2'),
zaxis = list(title = 'Component 3'))) While, PCA assigned each country to a point in a 3-dimensional plane, the role of K-Means is only to assign each country to one of the K clusters. Thus a K-Means model with n clusters, where n is the number of data points will essentially be a PCA.
Goes to show that PCA is one step on top of k-means
## Agriculture Mining Manufacturing PowerSupply Construction
## Belgium 3.3 0.9 27.6 0.9 8.2
## Denmark 9.2 0.1 21.8 0.6 8.3
## France 10.8 0.8 27.5 0.9 8.9
## WGermany 6.7 1.3 35.8 0.9 7.3
## Ireland 23.2 1.0 20.7 1.3 7.5
## Italy 15.9 0.6 27.6 0.5 10.0
## Luxembourg 7.7 3.1 30.8 0.8 9.2
## Netherlands 6.3 0.1 22.5 1.0 9.9
## United Kingdom 2.7 1.4 30.2 1.4 6.9
## Austria 12.7 1.1 30.2 1.4 9.0
## Finland 13.0 0.4 25.9 1.3 7.4
## Greece 41.4 0.6 17.6 0.6 8.1
## Norway 9.0 0.5 22.4 0.8 8.6
## Portugal 27.8 0.3 24.5 0.6 8.4
## Spain 22.9 0.8 28.5 0.7 11.5
## Sweden 6.1 0.4 25.9 0.8 7.2
## Switzerland 7.7 0.2 37.8 0.8 9.5
## Turkey 66.8 0.7 7.9 0.1 2.8
## Bulgaria 23.6 1.9 32.3 0.6 7.9
## Czechoslovakia 16.5 2.9 35.5 1.2 8.7
## EGermany 4.2 2.9 41.2 1.3 7.6
## Hungary 21.7 3.1 29.6 1.9 8.2
## Poland 31.1 2.5 25.7 0.9 8.4
## Romania 34.7 2.1 30.1 0.6 8.7
## Russia 23.7 1.4 25.8 0.6 9.2
## Yugoslavia 48.7 1.5 16.8 1.1 4.9
## ServiceIndustry Finance PersonalServices TransportAndComm
## Belgium 19.1 6.2 26.6 7.2
## Denmark 14.6 6.5 32.2 7.1
## France 16.8 6.0 22.6 5.7
## WGermany 14.4 5.0 22.3 6.1
## Ireland 16.8 2.8 20.8 6.1
## Italy 18.1 1.6 20.1 5.7
## Luxembourg 18.5 4.6 19.2 6.2
## Netherlands 18.0 6.8 28.5 6.8
## United Kingdom 16.9 5.7 28.3 6.4
## Austria 16.8 4.9 16.8 7.0
## Finland 14.7 5.5 24.3 7.6
## Greece 11.5 2.4 11.0 6.7
## Norway 16.9 4.7 27.6 9.4
## Portugal 13.3 2.7 16.7 5.7
## Spain 9.7 8.5 11.8 5.5
## Sweden 14.4 6.0 32.4 6.8
## Switzerland 17.5 5.3 15.4 5.7
## Turkey 5.2 1.1 11.9 3.2
## Bulgaria 8.0 0.7 18.2 6.7
## Czechoslovakia 9.2 0.9 17.9 7.0
## EGermany 11.2 1.2 22.1 8.4
## Hungary 9.4 0.9 17.2 8.0
## Poland 7.5 0.9 16.1 6.9
## Romania 5.9 1.3 11.7 5.0
## Russia 6.1 0.5 23.6 9.3
## Yugoslavia 6.4 11.3 5.3 4.0
## cluster
## Belgium 2
## Denmark 2
## France 2
## WGermany 2
## Ireland 2
## Italy 2
## Luxembourg 2
## Netherlands 2
## United Kingdom 2
## Austria 2
## Finland 2
## Greece 1
## Norway 2
## Portugal 2
## Spain 2
## Sweden 2
## Switzerland 2
## Turkey 3
## Bulgaria 1
## Czechoslovakia 1
## EGermany 1
## Hungary 1
## Poland 1
## Romania 1
## Russia 1
## Yugoslavia 3
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.9021582 1.6584794 1.0993847 0.99968878 0.7523550
## Proportion of Variance 0.3618206 0.2750554 0.1208647 0.09993777 0.0566038
## Cumulative Proportion 0.3618206 0.6368760 0.7577407 0.85767843 0.9142822
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.61924270 0.48131550 0.39032100 0.299425677
## Proportion of Variance 0.03834615 0.02316646 0.01523505 0.008965574
## Cumulative Proportion 0.95262838 0.97579484 0.99102989 0.999995462
## Comp.10
## Standard deviation 6.736270e-03
## Proportion of Variance 4.537734e-06
## Cumulative Proportion 1.000000e+00
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Agriculture 0.481 0.212 0.209 0.155
## Mining -0.104 0.474 -0.387 -0.215 -0.743
## Manufacturing -0.391 0.173 -0.228 -0.345 -0.413 -0.289 0.443 0.230
## PowerSupply -0.268 -0.635 0.300 0.346 0.357 0.274 -0.258
## Construction -0.337 0.221 -0.633 0.392 0.136 -0.268 -0.197
## ServiceIndustry -0.282 -0.426 -0.242 0.612 -0.191 0.185
## Finance -0.456 -0.389 -0.170 0.392 -0.528 -0.126
## PersonalServices -0.331 -0.280 0.266 0.455 -0.214 -0.264 -0.141 -0.490
## TransportAndComm -0.402 0.101 0.243 0.378 0.423 -0.117 0.651
## cluster 0.260 -0.465 -0.245 -0.169 -0.172 0.376
## Comp.9 Comp.10
## Agriculture 0.806
## Mining
## Manufacturing 0.366
## PowerSupply 0.201
## Construction 0.384
## ServiceIndustry -0.414 0.238
## Finance -0.374 0.145
## PersonalServices 0.190 0.351
## TransportAndComm
## cluster 0.674
## Comp.1 Comp.2 Comp.3 Comp.4
## Belgium -1.26398437 -1.74452902 -0.001254644 0.30088036
## Denmark -0.42319466 -2.13009597 1.267561841 0.68255043
## France -0.37568334 -1.38641351 -0.211261881 -0.59531701
## WGermany -0.66417801 -0.54606693 -0.705667544 -0.17071675
## Ireland 0.30795143 -0.56803651 -0.444936062 0.84560846
## Italy -0.15737517 -0.78384936 1.239449014 -1.27738004
## Luxembourg -1.00888676 0.02964664 -0.915910071 -0.86246689
## Netherlands -1.11776190 -2.31783849 0.493832715 -0.07058767
## United Kingdom -1.29458951 -1.17511448 -1.318856435 1.07690996
## Austria -0.99904082 -0.52319862 -1.104919156 -0.30639463
## Finland -0.66825074 -1.16125689 -0.387187335 1.08772186
## Greece 1.66821725 1.10063920 1.421663839 -0.24578174
## Norway -1.33616753 -1.43721163 1.245139034 1.11960192
## Portugal 1.17330445 -0.40684706 0.867566173 -0.62428062
## Spain 0.62456690 -0.54134131 -0.258016002 -2.62650851
## Sweden -0.61866117 -1.80743659 0.529021065 1.04681985
## Switzerland -0.74468773 -1.14058159 -0.259581241 -1.92887551
## Turkey 6.74912683 0.07295877 0.733406985 1.39203411
## Bulgaria 0.01052941 2.18481086 0.861122924 -0.21255162
## Czechoslovakia -1.25452582 2.66939595 -0.599113312 -0.12345973
## EGermany -2.55119521 2.35217059 -0.802016982 0.69823690
## Hungary -1.45418762 2.91662644 -1.588221915 1.18781814
## Poland 0.31319967 2.59746270 0.346285490 0.12767890
## Romania 1.25913757 2.62765892 0.377216957 -1.29492928
## Russia -0.74978990 1.95132736 2.179866351 0.70615783
## Yugoslavia 4.57612673 -0.83287948 -2.965189807 0.06723130
## Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## Belgium -0.26493349 -0.05031098 -0.30522353 0.19871028 -0.39551485
## Denmark 0.16472253 -0.82939769 -0.22554242 -0.45419677 -0.04681476
## France -0.20433875 0.11067018 -0.08998681 -0.39634443 -0.11213003
## WGermany -1.14108865 -0.62470643 0.45544698 0.20605008 -0.04837512
## Ireland 0.08990711 1.42160626 0.02147256 -0.35276220 0.07791895
## Italy -0.82101099 1.00486930 -0.20483324 0.03564436 0.35256273
## Luxembourg -1.01519509 0.22333677 -1.75693579 0.37315760 -0.18921222
## Netherlands 0.73124718 0.21147472 -0.21943217 -0.64577247 0.09407475
## United Kingdom -0.64390343 -0.04597553 0.17429240 -0.48561713 -0.10682667
## Austria 0.51605353 0.80576682 0.32279823 0.39873937 0.21543204
## Finland 0.59999201 0.08128513 0.66290541 0.14765721 0.07960810
## Greece 1.12682419 0.82643411 0.24964439 0.11510659 -1.06355289
## Norway 0.77431994 -0.03604708 -0.41728307 0.94648704 0.06089153
## Portugal -0.17324753 0.51739112 0.40307387 0.10926017 0.20663939
## Spain 1.36869222 -0.78753102 -0.17555349 -0.14228175 0.43335863
## Sweden -0.43163150 -0.87711688 0.17402495 -0.48152943 -0.07033578
## Switzerland -0.58604676 0.17550895 1.01967829 0.48630810 -0.04168648
## Turkey -1.26397102 0.23565361 -0.06675019 0.15525190 0.37681530
## Bulgaria -0.62801419 -0.55941867 0.27980994 -0.01923775 -0.22023873
## Czechoslovakia -0.32086269 -0.11022522 -0.05339428 -0.20253305 0.15523434
## EGermany -0.77169122 -0.54169934 0.31624542 0.47217416 -0.07718427
## Hungary 0.91674303 0.77415319 -0.01969164 -0.27755694 0.31282140
## Poland 0.32349143 0.02399788 -0.42162191 -0.25589146 -0.07605085
## Romania -0.31414827 -0.30638122 0.12700769 -0.59510089 -0.14798054
## Russia 0.90006482 -0.78094985 -0.09898534 0.36225716 0.40991418
## Yugoslavia 1.06802559 -0.86238813 -0.15116625 0.30202025 -0.17936815
## Comp.10
## Belgium -0.0014787407
## Denmark 0.0153670356
## France -0.0007976758
## WGermany -0.0064772992
## Ireland 0.0108685808
## Italy 0.0061906947
## Luxembourg 0.0036874581
## Netherlands -0.0110757776
## United Kingdom -0.0057793140
## Austria -0.0023301685
## Finland 0.0061163175
## Greece -0.0041461928
## Norway -0.0085464274
## Portugal 0.0032645068
## Spain -0.0058784190
## Sweden -0.0031452857
## Switzerland 0.0018930369
## Turkey -0.0078639053
## Bulgaria -0.0037678286
## Czechoslovakia -0.0104702057
## EGermany 0.0074496218
## Hungary -0.0012881024
## Poland -0.0018677731
## Romania 0.0069421421
## Russia 0.0062086523
## Yugoslavia 0.0069250692
pca_kmeans_model <- pca$scores %>% as.data.frame() %>% select(Comp.1, Comp.2) %>%
kmeans(centers = 7, nstart = 25)
fviz_cluster(pca_kmeans_model, data = data) + theme_minimal() + ggtitle("My title")kmeans_model <- kmeans(data_scaled, 7, nstart = 25)
fviz_cluster(kmeans_model, data = data_scaled) + theme_minimal() + ggtitle("My title")K-Means in Matrix Form
K-Means algorithm is essentially an optimization problem where the objective function is to come up with centroids such that the residual sum squares is minimized. The only constraint is that each of the data points can belong to only one cluster.
In other words, K-means and PCA maximize the same objective function, with the only difference being that K-means has additional “categorical” constraint.
PCA is relaxed stuff Throw in a little matrix and images. Credit quora.
https://www.quora.com/What-is-the-relationship-between-K-means-and-PCA
PCA is relaxed k means
For HOMEWORK
# ```{R}
kmeans_model <- kmeans(data_scaled, 5, nstart = 25)
fviz_cluster(kmeans_model, data = data_scaled) + theme_minimal() + ggtitle("My title")data$cluster <- kmeans_model$cluster
histo_viz <- list()
for (i in colnames(data_scaled)) {
mean_variable <- paste0('mean(', i, ')')
histo_viz[[i]] <- data %>% group_by(cluster) %>%
summarize_(.dots = setNames(mean_variable, "mean")) %>%
ggplot(aes(cluster, mean)) + geom_col() + coord_flip() +
ylab(paste("Average", i)) + xlab("Cluster")
}
grid.arrange(grobs = histo_viz, nrow = 3, ncol = 3,
top = "Histograms to visualize data ranges" )